knitr::opts_chunk$set(echo=TRUE, message = FALSE, warning = FALSE, cache = TRUE, fig.width = 5, dpi = 150, autodep = TRUE, tidy = TRUE, tidy.opts=list(blank=TRUE, width.cutoff=60))
The first process in the tool kit is to assess the data. This involves identifying and excluding obvious outliers and for the linear regression methods identifying linear portions of the data. There are many ways of detecting outliers and any statistical handbook will provide examples, however in the tool kit three approaches are used, a text summary, box and scatter plots.
An example script TKit_check_data.R, with an example data file DataTempate_Example1.csv
First clear the R workspace
rm(list= ls())
Data are entered into an R object called a data.frame, in the example scripts the name of this object is data. First enter the data file name, and if necessary a path to that data file, which is only necessary if the data file is not in the operating directory that R is using.
FName<-"DataTemplate_Example1.csv"
(Note when entering a path into R use double \ to separate directories as in example below)
FName <- system.file("extdata/DataTemplate_Example1.csv", package = "ecostattoolkit")
Now read the data into a data.frame called data
data <- read.csv(file = FName, header = TRUE)
The data can be quickly checked using R’s generic summary function which returns the minimum and maximum value, 25th, 50th (median), 75th quantiles and the mean for each variable. It also givens the number of missing values (NA). The maximum and minimum values may give an indication of potential outliers. The R function dim also allows the size of the data.frame to be checked (number rows or records, number of columns or variables).
summary(data) dim(data)
Using the example data file, note very high maximum TP and TN values in comparison to the interquartile range. There are 273 records with 7 columns of data.
Boxplots can be used to detect outliers, by default R’s generic boxplot function classifies an observation as an outlier when its value reaches more than 1.5 times the interquartile range. For our purposes the most appropriate approach is to look for P and then N outlier values by biological class (Figure A8.). (Including $out in the boxplot function prints the outlier values in addition to producing the plot, labels are added for the y axis to aid interpretation).
par(mfrow = c(1,2)) boxplot(data$P ~ as.factor(data$BioClass),ylab="P conc")$out boxplot(data$N ~ as.factor(data$BioClass),ylab="N conc")$out par(mfrow = c(1,1))
If the data set contain a grouping variable, such as water body type or country it may be helpful to look for outliers in different groups (parameter las = 3 is used to rotate text to make it readable). Note that with the example data there are too many groups (countries) to provide a useful output.
boxplot(data$P ~ as.factor(data$BioClass)+as.factor(data$Group), ylab="P",las=3)$out
Scatter plots are an alternative and potentially more informative way of identifying outliers as they show a continuous relationship between EQR and TP or TN.
The R function identify can be used to interact with the plot and label the points that appear to be outliers; the identify function requires the x variable, y variable and a variable to print to be specified. After running the line including the identify function click on the plot near a point that could be an outlier and the Record number will be printed. To exit the interactive mode right click and select Stop. An example output is shown (Figure A9)
plot(data$EQR ~ data$P,log="x")#note the nutrient axis is set to a log scale identify(data$P,data$EQR,data$Record)
plot(data$EQR ~ data$P,log="x") ol <- subset(data, Exclude_P == T) text(ol$P, ol$EQR, ol$Record, pos = 1)
Note care should be taken when selecting outliers as there is a temptation to remove or “prune” too many values which is not always appropriate. Having identified possible outliers mark these by editing the original data set by setting the value in field Exclude = 1, alternatively delete the records from the data set, although this does not provide an audit trail or the opportunity to check the potential influence of the outliers on subsequent analysis. To maintain an audit trail and increase flexibility the took kit scripts assume that outliers are marked rather than deleted, this would allow comparisons of results if outliers were changed in subsequent analyses.
If linear regression analysis is to be used it is necessary to identify linear regions of the data. The approach used in the tool kit is to examine a scatter plot and to fit a gam model to the data to check the shape of the relationship. Assuming this indicates that the relationship is not linear it then fits a segmented regression to identify where there are significant changes in slope of the relationship.
An example script TKit_check_linearity.R, with an example data file DataTemplate_Example1.csv
For this analysis, we need to load the mgcv and segmented R libraries
library(mgcv)# gam modelling library(segmented)# segmented regression
Next load the data, which either has had outliers removed or marked. (note the field Exclude is still required for script to run, even if all outliers have been removed, i.e. all records marked as Exclude = 0, as the variable name Exclude is included in the R scripts as a selection criteria)
#Enter file name and path FName <- system.file("extdata/DataTemplate_Example1.csv", package = "ecostattoolkit") #Get data and check data <- read.csv(file = FName, header = TRUE) dim(data)
As many data sets will contain missing values the data are checked and a second data.frame called data.cc is created with no missing values, although in the example data set there are no missing values.
cc<-complete.cases(data$EQR,data$P) data.cc<-data[cc,] dim(data.cc)
To make the remaining script simpler the variable names used in the data file are assigned to simpler standard variables which are used for the remainder of the script. Two sets of variables are used, the first contain all the data x, y, z and a second set, which are used for modelling, have had the outliers excluded x.u, y.u, z.u. (note below that when selecting data R requires the use of == rather than =). For plotting, the number of categories (zN) of the Group variable is required so that each is presented in a different colour.
As the relationship between biological status and nutrients is nearly always logarithmic, nutrient values are log10 transformed. (The following example script is for TP, for TN edit the second line changing TP for TN)
#All data including outliers (for information) x<-log10(data.cc$P) # log10 transform of independent variable x2<-10^x # back transformed value for plotting on linear scale y<-data.cc$EQR z<-as.factor(data.cc$Group) # needs to be a factor to act as categorical variable zN<-length(levels(z)) #number of categories in data
#Data used for modelling, excluding outliers x.u<-log10(data.cc$P[data.cc$Exclude_P==0]) # selects records where Exclude_P = 0 x2.u<-10^x.u y.u<-data.cc$EQR[data.cc$Exclude_P==0] z.u<-as.factor(data.cc$Group[data.cc$Exclude_P==0]) zN<-length(levels(z)) #number of categories in data
First set up the axis ranges and provide meaningful axis labels. Edit the values in the lines below to match your data, remember that text for labels needs to be enclosed by “ “.
xmin<-1 xmax<-1000 ymin<-0 ymax<-1.5 xlb<-expression(paste("total phosphorus ","(µg ", L^-1,")")) ylb<-"EQR"
Plot all the data (including outliers) as open circles (not shown)
plot(y~x2,log="x",ylim=c(ymin,ymax),xlim=c(xmin,xmax),xlab=xlb,ylab=ylb,cex=1)
Then colour the data that are not outliers by adding these as points. Colours are determined by the Group variable which was assigned to the object z.u. A legend is added at the top right of the plot, to change the legend position edit the 3rd line of script below using alternatives “topleft”,”bottomleft”, “bottomright”. Note that the size of legend text is determined by cex.
points(y.u~x2.u,col=rainbow(zN)[z.u],pch=19) legend("topright",levels(z.u),col=rainbow(zN),pch=19,cex=0.8)
Fit a gam model and add fitted line to the plot, the first line of script can be edited to change the value of k if necessary. This sets the degree of smooth applied by changing the number of knots used
k<-9 # number of knots for smoother, decrease value to produce a smoother curve fit.gam<- gam(y.u~s(x.u,k=k)) print(summary(fit.gam))
To plot the figure:
new.x.u <- data.frame(x.u = seq(min(x.u, na.rm=T), max(x.u, na.rm=T), length = length(x.u))) pred.gam <- predict(fit.gam, newdata=new.x.u, se.fit=TRUE) ylower <- pred.gam$fit - pred.gam$se.fit yupper <- pred.gam$fit + pred.gam$se.fit plot(y~x2,log="x",ylim=c(ymin,ymax),xlim=c(xmin,xmax),xlab=xlb,ylab=ylb,cex=1) points(y.u~x2.u,col=rainbow(zN)[z.u],pch=19) legend("topright",levels(z.u),col=rainbow(zN),pch=19,cex=0.8) lines(10^(new.x.u$x.u),(pred.gam$fit), col="black") mtext("GAM", side = 3, adj=0,line =0, col="black",cex=0.8) mtext(paste("R2=", round(summary(fit.gam)$r.sq,3),sep=""), side = 3, line=0,adj=.5, col="black",cex=0.8) pvalue <- summary(fit.gam)$s.pv if(pvalue >=0.001) {mtext(paste("p=", round(pvalue,3), sep=""),side = 3, line =0, adj = 1, col = "black",cex=0.8)} if(pvalue <0.001) {mtext("p<0.001", side = 3, line= 0, adj = 1, col="black",cex=0.8)}
The plot from the test data set clearly shows that the relationship is not linear, levelling off at both low and high TP concentrations (Figure A10). It is possible to estimate the linear range by eye, probably 10 – 100 µgTPl-1 but an alternative approach is to use segmented regression. Note that it is much easier to appreciate that these data are not linear than when using the Excel tool kit (Figure A2).
To fit a segmented regression, it is necessary to estimate the break points. In the test data set curvature is not that marked, thus it is best to start with a single break point. Other data sets may clearly require 2 break points and to accommodate these two different functions have been provided.
First it is necessary to set up two user defined functions Myseg (single break point) and Myseg2 (2 break point). Run the following lines of script to load the user functions into R
Myseg<-function(x2,y,EstBk,EstBk2) { Eb<-log10(EstBk) Eb2<-log10(EstBk) x<-log10(x2) print(mod<-lm(y ~ x)) print(o<-segmented(mod,seg.Z=~x,psi=list(x = c(Eb))))#--fit segmented model print(summary(o)) print(bkpt<-10^(o$psi[1,2])) # Have to include the plotting code in the function plot(y~x2,log="x",ylim=c(ymin,ymax),xlim=c(xmin,xmax),xlab=xlb,ylab=ylb,cex=1) points(y.u~x2.u,col=rainbow(zN)[z.u],pch=19) legend("topright",levels(z.u),col=rainbow(zN),pch=19,cex=0.8) lines(10^(new.x.u$x.u),(pred.gam$fit), col="black") mtext("GAM", side = 3, adj=0,line =0, col="black",cex=0.8) mtext(paste("R2=", round(summary(fit.gam)$r.sq,3),sep=""), side = 3, line=0,adj=.5, col="black",cex=0.8) pvalue <- summary(fit.gam)$s.pv if(pvalue >=0.001) {mtext(paste("p=", round(pvalue,3), sep=""),side = 3, line =0, adj = 1, col = "black",cex=0.8)} if(pvalue <0.001) {mtext("p<0.001", side = 3, line= 0, adj = 1, col="black",cex=0.8)} segments(10^(o$psi[1,2]-o$psi[1,3]),ymin,10^(o$psi[1,2]+o$psi[1,3]),ymin) points(10^(o$psi[1,2]),ymin,pch=3,cex=1.5) text(10^(o$psi[1,2]),ymin,round(10^(o$psi[1,2])),pos=3) print(10^(o$psi[1,2]-o$psi[1,3])) # lower estimate of break point print(10^(o$psi[1,2]+o$psi[1,3])) (int<-coef(o)[1]) (slope<-coef(o)[2]) (slope2<-coef(o)[2] + coef(o)[3]) (int2<- int + slope * o$psi[1,2] - slope2 * o$psi[1,2] ) abline(int,slope, col="blue") #abline(int2,slope2,col="green") }
Myseg2<-function(x2,y,EstBk,EstBk2) { Eb<-log10(EstBk) Eb2<-log10(EstBk2) x<-log10(x2) print(mod<-lm(y ~ x)) print(o<-segmented(mod,seg.Z=~x,psi=list(x = c(Eb,Eb2))))#- segmented model print(summary(o)) #First break point print(bkpt<-10^(o$psi[1,2])) # Have to include the plotting code in the function plot(y~x2,log="x",ylim=c(ymin,ymax),xlim=c(xmin,xmax),xlab=xlb,ylab=ylb,cex=1) points(y.u~x2.u,col=rainbow(zN)[z.u],pch=19) legend("topright",levels(z.u),col=rainbow(zN),pch=19,cex=0.8) lines(10^(new.x.u$x.u),(pred.gam$fit), col="black") mtext("GAM", side = 3, adj=0,line =0, col="black",cex=0.8) mtext(paste("R2=", round(summary(fit.gam)$r.sq,3),sep=""), side = 3, line=0,adj=.5, col="black",cex=0.8) pvalue <- summary(fit.gam)$s.pv if(pvalue >=0.001) {mtext(paste("p=", round(pvalue,3), sep=""),side = 3, line =0, adj = 1, col = "black",cex=0.8)} if(pvalue <0.001) {mtext("p<0.001", side = 3, line= 0, adj = 1, col="black",cex=0.8)} segments(10^(o$psi[1,2]-o$psi[1,3]),ymin,10^(o$psi[1,2]+o$psi[1,3]),ymin) points(10^(o$psi[1,2]),ymin,pch=3,cex=1.5) text(10^(o$psi[1,2]),ymin,round(10^(o$psi[1,2])),pos=3) print(10^(o$psi[1,2]-o$psi[1,3])) # lower estimate of break point print(10^(o$psi[1,2]+o$psi[1,3])) #Second break point print(bkpt2<-10^(o$psi[2,2])) segments(10^(o$psi[2,2]-o$psi[2,3]),ymin,10^(o$psi[2,2]+o$psi[2,3]),ymin) points(10^(o$psi[2,2]),ymin,pch=3,cex=1.5) text(10^(o$psi[2,2]),ymin,round(10^(o$psi[2,2])),pos=3) print(10^(o$psi[2,2]-o$psi[2,3])) # lower estimate of break point print(10^(o$psi[2,2]+o$psi[2,3])) #Calculate slopes and intercepts for each segment #Seg1 (int<-coef(o)[1]) (slope<-coef(o)[2]) abline(int,slope, col="green") #Seg2 (slope2<-coef(o)[2] + coef(o)[3]) (int2<- int + slope * o$psi[1,2] - slope2 * o$psi[1,2] ) abline(int2,slope2,col="blue") #Seg3 (slope3<-coef(o)[3] + coef(o)[4]) (int3<- int2 + slope2 * o$psi[2,2] - slope3 * o$psi[2,2] ) abline(int3,slope3,col="red") }
For a single break point edit the first line of script to enter the approximate break point, then run the Myseg function line.
EstBk<-75 # edit this line with estimated break point Myseg(x2.u,y.u,EstBk)
For two break points edit the next two lines of script to enter the approximate break points, then run the Myseg2 function line.
EstBk<-25 # edit this line with estimated lower break point EstBk2<-100 # edit this line with estimated upper break point Myseg2(x2.u,y.u,EstBk,EstBk2)
These scripts add best fit lines to the plot and mark the estimated break points, together with their upper and lower confidence limits, which are printed on the screen as the last lines of output. e.g. for the 1 break point method the screen shows the list below (Figure A11b). break point (137), followed by its lower (102) and upper (183) confidence limits
With the example data set fitting two break points produces rather unstable predictions, with the lower break varying from 9 -30 µgl-1, an indication that curvature at the lower end of the relationship is weak and that the uncertainty is high.
par(mfrow = c(1,2)) EstBk<-25 # edit this line with estimated lower break point EstBk2<-100 # edit this line with estimated upper break point Myseg2(x2.u,y.u,EstBk,EstBk2) EstBk<-75 # edit this line with estimated break point Myseg(x2.u,y.u,EstBk) par(mfrow = c(1,1))
From these outputs, it is possible to determine what range of data can be used to fit a sensible linear relationship. In this example the curvature is not that great and the estimated break points (30 µgl-1 & 125 µgl-1) would remove many of the data points, thus a single upper limit is used of 138 µgl-1 based on the single break point model. The blue line shows the resulting linear relationship, which tracks the linear portion of the GAM model very closely. These break points also provide sufficient data above and below the high/good and good/moderate boundary EQR values (0.8 , 0.6) and thus meet the requirements for predicting the TP values at the boundaries. They are also nutrient concentrations that fit with ecological expectations (Phillips et al., 2008). It is important to stress that selecting the segment of data to be used will have a significant influence on the slope of the linear models and thus the predicted boundary values and is to an extent a subjective decision, albeit guided by the gam and segmented regression models. Using the Excel tool different segments of data can quickly be selected and the results compared.
In the case of data from a single water body type the objective is to fit a simple linear regression. As explained in section in section 2.4 there is uncertainty regarding the true slope of the relationship when the independent variable (nutrient concentration) is likely to contain significant uncertainty. Thus, the tool kit fits 3 models and estimates the boundary values from each model
It should be noted that the script for fitting these models is relatively complex and it is not recommended that users unfamiliar with R should attempt to edit the script and thus the default variables in the example data file should be used wherever possible.
A second more complex script is provided which deals with data that includes different water body types (TKit_fit_lin_mod2.R). This is useful where data for a single type do not have a sufficient range of nutrient concentrations to fit a robust model, but it can be demonstrated that the types have similar slopes. In this case RMA regression cannot be used and an alternative approach where the slopes of the two OLS regressions are averaged to provide an geometric mean regression which achieves a similar result.
Example script TKit_fit_lin_mod1.R and the data file DataTemplate_Example1.csv (data are from a single lake type) For this example, we will continue with TP data, for this analysis we need to load the lmodel2 library. As above we start by clearing the workspace and loading the data
library(lmodel2)# Type 2 regression rm(list= ls()) # Clear data
Enter file name and path
FName <- system.file("extdata/DataTemplate_Example1.csv", package = "ecostattoolkit")
Get data and check
data <- read.csv(file = FName, header = TRUE) dim(data) summary(data)
Edit the following line if TP was to be used
#Identify the records with both EQR and P data cc<-complete.cases(data$EQR,data$P) data.cc<-data[cc,] dim(data.cc)
# Data used for modelling, excluding outliers**
Assign all the data to x and y variables, including outliers, (so that the outliers can be plotted for information, they are excluded for model fitting within the script)
x<-log10(data.cc$P) x2<-10^x y<-data.cc$EQR
Identify the data to be used for modelling, edit the following lines to specify the linear section of the data. In this example the TP data is linear below 138 µgl-1. However, to allow for situations where 2 breaks are needed both a minimum and maximum value are required, with a lower value in this example of 1 µgl-1.
nut.min<-1 # min nutrient value used for model nut.max<-138 # max nutrient value used for model
To calculate boundary values EQR boundaries are entered. Edit if normalised boundaries of 0.80 and 0.60 are not appropriate
HG<-0.80 GM<-0.60
Assign the data, edit the next line if TP or other variable names are used
x.u<-log10(data.cc$P[data.cc$P<=nut.max & data.cc$P>=nut.min & data.cc$Exclude_P==FALSE]) y.u<-data.cc$EQR[data.cc$P<=nut.max & data.cc$P>=nut.min & data.cc$Exclude_P==FALSE] x2.u<-10^x.u #back transformed x for plotting on linear scale
The following 2 lines allow a check that the number of records for x and y are the same, a useful check if variable names have had to be changed
length(x.u) length(y.u)
Fit OLS model 1 (EQR v Nutrient) and assign intercept to int1 and slope to slope1
summary(mod1<-lm(y.u~x.u))#Fit model (int1<-summary(mod1)[[4]][[1]]) (slope1<-summary(mod1)[[4]][[2]])
Predict values from model 1 and calculate the upper and lower quantiles of the residuals to use as estimates of the possible range of boundary values
pred.y1<-data.frame(predict(mod1),resid(mod1)) #Calculate predicted values and residuals (upresid.1<-quantile(resid(mod1),0.75,na.rm=T)) #All residuals (lwresid.1<-quantile(resid(mod1),0.25,na.rm=T))
Calculate the good/moderate and high/good boundary values using model 1 parameters (GM.mod1, HG.mod1) and estimate the range of boundary values using the upper and lower quantiles of the residuals (GMU.mod1, GML.mod1, HGU.mod1, HGL.mod1)
(GM.mod1<-round(10^((GM-int1)/slope1),0)) # Predicted GM boundary value (GMU.mod1<-round(10^((GM-int1-upresid.1)/slope1),0)) (GML.mod1<-round(10^((GM-int1-lwresid.1)/slope1),0)) (HG.mod1<-round(10^((HG-int1)/slope1),0)) # Predicted HG boundary value (HGU.mod1<-round(10^((HG-int1-upresid.1)/slope1),0)) (HGL.mod1<-round(10^((HG-int1-lwresid.1)/slope1),0))
Fit OLS model 2 (Nutrient v EQR) and assign intercept to int2 and slope to slope2
summary(mod2<-lm(x.u~y.u)) # Fit model (int2<-summary(mod2)[[4]][[1]]) (slope2<-summary(mod2)[[4]][[2]])
To compare model 2 with model 1 we need to invert the model so that
x = slope2 * y y = 1/slope2 *x
thus the inverted model has a slope of 1/slope2.
As we know that the definition of the best fit line passes through the point (x̅, y̅) then the intercept value for the inverted model 2 can be calculated as y̅ - x̅/slope2. This is the value of y when x = 0 which the script assigns to variable y0.
y0<-mean(y.u)- (mean(x.u)/slope2)
Set up a new data.frame with observed data x.u, y.u and calculated y0 from inverted model 2 and add the calculated predicted y values and the model residuals. Calculate the upper and lower quantiles of the residuals to use as estimates of the possible range of boundary values
pred.y2<-data.frame(x.u,y.u,y0) # set up data.frame pred.y2<-within(pred.y2,pred.y2<-x.u*1/slope2+y0) # calc predicted values pred.y2<-within(pred.y2,resid.2<-y.u-pred.y2) # calc residuals resid.2 upresid.2<-quantile(pred.y2$resid.2,0.75,na.rm=T) lwresid.2<-quantile(pred.y2$resid.2,0.25,na.rm=T)
Calculate the good/moderate and high/good boundary values using inverted model 2 parameters (GM.mod2, HG.mod2) and estimate the range of boundary values using the upper and lower quantiles of the residuals (GMU.mod2, GML.mod2, HGU.mod2, HGL.mod2)
(GM.mod2<-round(10^((GM-y0)*slope2),0)) # Predicted GM boundary value (GMU.mod2<-round(10^((GM-y0-upresid.2)*slope2),0)) (GML.mod2<-round(10^((GM-y0-lwresid.2)*slope2),0)) (HG.mod2<-round(10^((HG-y0)*slope2),0)) # Predicted HG boundary factor (HA) (HGU.mod2<-round(10^((HG-y0-upresid.2)*slope2),0)) (HGL.mod2<-round(10^((HG-y0-lwresid.2)*slope2),0))
Fit model 4 (EQR v Nutrient) using type II Ranged Major Axis (RMA) regression and assign intercept to RMA.int and slope to RMA.slope. (note model 3 in the Tool Kit is a geometric mean regression calculated by taking the geometric average of the slopes of model 1 and inverted model 2. It is not needed for the simple case of a relationship without a factor).
To fit a RMA regression we use R package lmodel2 which performs a variety of regression methods including RMA
lmod<-lmodel2(y.u~x.u,range.y="relative",range.x="interval",nperm=99) lmod # provides a summary of all models including conventional OLS (RMA.int<-lmod$regression.results[[2]][[4]]) (RMA.slope<-lmod$regression.results[[3]][[4]])
Predict values from model 4 and calculate the upper and lower quantiles of the residuals to use as estimates of the possible range of boundary values
pred.yRMA<-x.u*RMA.slope+RMA.int (upresid.RMA<-quantile(resid.RMA<-y.u-pred.yRMA,0.75,na.rm=T)) (lwresid.RMA<-quantile(resid.RMA<-y.u-pred.yRMA,0.25,na.rm=T))
Calculate the good/moderate and high/good boundary values using model 4 parameters (GM.mod4, HG.mod4) and estimate the range of boundary values using the upper and lower quantiles of the residuals (GMU.mod4, GML.mod4, HGU.mod4, HGL.mod4)
(GM.mod4<-round(10^((GM-RMA.int)/RMA.slope),0))#Predicted GM boundary value (GML.mod4<-round(10^((GM-RMA.int-lwresid.RMA)/RMA.slope),0)) (GMU.mod4<-round(10^((GM-RMA.int-upresid.RMA)/RMA.slope),0)) (HG.mod4<-round(10^((HG-RMA.int)/RMA.slope),0))#Predicted HG boundary value (HGL.mod4<-round(10^((HG-RMA.int-lwresid.RMA)/RMA.slope),0)) (HGU.mod4<-round(10^((HG-RMA.int-upresid.RMA)/RMA.slope),0))
This section will produce 3 plots comparing the three models and the predicted good/moderate nutrient boundary values. The final section places a copy of the boundary values into the clipboard to allow values to be pasted into Excel to create a summary table
First set up the axis ranges and provide axis labels. Edit the values in the lines below to match your data.
x<-log10(data.cc$P) x2<-10^x y<-data.cc$EQR xmin<- 5 xmax<- 500 ymin<- 0 ymax<- 1.5 xlb<-expression(paste("total phosphorus ","(µg ", L^-1,")")) ylb<-"EQR" # Y axis label
Create a new graphics window and set up 3 panes with appropriate margins
#win.graph(width=14,height=4.5)# new graphic window par(mfrow=c(1,3)) par(mar=c(4,5,3.5,1))
Set up title for Fig A and produce plot.
title<-"Regression of EQR on P"
First plot all the data as open circles
plot(y~x2,ylim=c(ymin,ymax),xlim=c(xmin,xmax),main=title,log="x",ylab=ylb, xlab=xlb,las=1)
Mark the data points used for models as solid circles
points(y.u~x2.u,pch=19)
Add cross to mark data mean point ( x̅ ,y̅)
points((10^mean(x.u)),mean(y.u),pch=3,cex=3) # plot mean of x and y
Extract regression statistics and add to plot. R2, significance, slope and standard error of slope,
mtext(paste("R2=", round(summary(mod1)$r.sq,3),sep=""), side = 3, line=0,adj=0,col="black",cex=0.8) pvalue<-pf(summary(mod1)[[10]][[1]],summary(mod1)[[10]][[2]], summary(mod1)[[10]][[3]],lower.tail=F) if(pvalue >=0.001) {mtext(paste("p=", round(pvalue,3), sep=""),side = 3, line =0, adj = 0.25, col = "black",cex=0.8)} if(pvalue <0.001) {mtext("p<0.001", side = 3, line= 0, adj = 0.25, col="black",cex=0.8)} mtext(paste("slope= ", round(summary(mod1)[[4]][[2]],3)," se ", round(summary(mod1)[[4]][[4]],3),sep=""),side = 3,line=0,adj=0.75, col="black",cex=0.8)
Set up data.frame for fitted line and add to plot with upper and lower quartiles of residuals
new.x <- data.frame(x = seq(min(x, na.rm=T), max(x, na.rm=T), length = length(x))) # new values of X on log scale lines(10^(new.x$x),new.x$x*slope1+int1) lines(10^(new.x$x),new.x$x*slope1+int1+upresid.1,lty=2) lines(10^(new.x$x),new.x$x*slope1+int1+lwresid.1,lty=2) abline("h"=GM,lty=1) abline("v"=GM.mod1,lty=1) abline("v"=GMU.mod1,lty=3) abline("v"=GML.mod1,lty=3) text(xmin,GM+0.03,GM,pos=4) text(GM.mod1,ymin,GM.mod1,pos=3) text(GMU.mod1,ymin,GMU.mod1,pos=4) text(GML.mod1,ymin,GML.mod1,pos=2)
Set up title for Fig B and produce plot.
title<-"Regression of P on EQR" plot(y~x2,ylim=c(ymin,ymax),xlim=c(xmin,xmax),main=title,log="x",ylab=ylb,xlab=xlb, las=1) points(y.u~x2.u,pch=19) points((10^mean(x.u)),mean(y.u),pch=3,cex=3) # plot mean of x and y on linear scale mtext(paste("R2=", round(summary(mod2)$r.sq,3),sep=""), side = 3, line=0,adj=0, col="black",cex=0.8) pvalue<-pf(summary(mod2)[[10]][[1]],summary(mod2)[[10]][[2]], summary(mod2)[[10]][[3]],lower.tail=F) if(pvalue >=0.001) {mtext(paste("p=", pvalue, sep=""),side = 3, line =0, adj = 0.25, col = "black",cex=0.8)} if(pvalue <0.001) {mtext("p<0.001", side = 3, line= 0, adj = 0.25, col="black",cex=0.8)} mtext(paste("slope= ", round(1/summary(mod2)[[4]][[2]],3)," se ", round(summary(mod2)[[4]][[4]],3),sep=""), side = 3, line=0,adj=0.75, col="black",cex=0.8) new.x <- data.frame(x = seq(min(x, na.rm=T), max(x, na.rm=T), length = length(x)))#new values of X on log scale lines(10^(new.x$x),new.x$x*1/slope2+y0) lines(10^(new.x$x),new.x$x*1/slope2+y0+upresid.2,lty=2) lines(10^(new.x$x),new.x$x*1/slope2+y0+lwresid.2,lty=2) abline("h"=GM,lty=1) abline("v"=GM.mod2,lty=1) abline("v"=GMU.mod2,lty=3) abline("v"=GML.mod2,lty=3) text(xmin,GM+0.03,GM,pos=4) text(GM.mod2,ymin,GM.mod2,pos=3) text(GMU.mod2,ymin,GMU.mod2,pos=4) text(GML.mod2,ymin,GML.mod2,pos=2)
Set up title for Fig C and produce plot.
title<-"Ranged major axis regression EQR v N" plot(y~x2,ylim=c(ymin,ymax),xlim=c(xmin,xmax),main=title,log="x",ylab=ylb, xlab=xlb,las=1) points(y.u~x2.u,pch=19) points(10^(mean(x.u)),mean(y.u),pch=3,cex=3) mtext(paste("RMA slope=", round(RMA.slope,3),sep=""), side = 3, line=0,adj=0.6, col="black",cex=0.8) new.x <- data.frame(x = seq(min(x, na.rm=T), max(x, na.rm=T), length = 500)) lines(10^(new.x$x),new.x$x*RMA.slope+RMA.int) lines(10^(new.x$x),new.x$x*RMA.slope+RMA.int+upresid.RMA,lty=2) lines(10^(new.x$x),new.x$x*RMA.slope+RMA.int+lwresid.RMA,lty=2) abline("h"=GM,lty=1) abline("v"=GM.mod4,lty=1) abline("v"=GMU.mod4,lty=3) abline("v"=GML.mod4,lty=3) text(xmin,GM+0.03,GM,pos=4) text(GM.mod4,ymin,GM.mod4,pos=3) text(GMU.mod4,ymin,GMU.mod4,pos=4) text(GML.mod4,ymin,GML.mod4,pos=2)
par(mfrow=c(2,2)) par(mar=c(4,5,3.5,1)) plot(y~x2,ylim=c(ymin,ymax),xlim=c(xmin,xmax),main=title,log="x",ylab=ylb, xlab=xlb,las=1) points(y.u~x2.u,pch=19) points((10^mean(x.u)),mean(y.u),pch=3,cex=3) # plot mean of x and y mtext(paste("R2=", round(summary(mod1)$r.sq,3),sep=""), side = 3, line=0,adj=0,col="black",cex=0.8) pvalue<-pf(summary(mod1)[[10]][[1]],summary(mod1)[[10]][[2]], summary(mod1)[[10]][[3]],lower.tail=F) if(pvalue >=0.001) {mtext(paste("p=", round(pvalue,3), sep=""),side = 3, line =0, adj = 0.25, col = "black",cex=0.8)} if(pvalue <0.001) {mtext("p<0.001", side = 3, line= 0, adj = 0.25, col="black",cex=0.8)} mtext(paste("slope= ", round(summary(mod1)[[4]][[2]],3)," se ", round(summary(mod1)[[4]][[4]],3),sep=""),side = 3,line=0,adj=0.75, col="black",cex=0.8) new.x <- data.frame(x = seq(min(x, na.rm=T), max(x, na.rm=T), length = length(x))) # new values of X on log scale lines(10^(new.x$x),new.x$x*slope1+int1) lines(10^(new.x$x),new.x$x*slope1+int1+upresid.1,lty=2) lines(10^(new.x$x),new.x$x*slope1+int1+lwresid.1,lty=2) abline("h"=GM,lty=1) abline("v"=GM.mod1,lty=1) abline("v"=GMU.mod1,lty=3) abline("v"=GML.mod1,lty=3) text(xmin,GM+0.03,GM,pos=4) text(GM.mod1,ymin,GM.mod1,pos=3) text(GMU.mod1,ymin,GMU.mod1,pos=4) text(GML.mod1,ymin,GML.mod1,pos=2) title<-"Regression of P on EQR" plot(y~x2,ylim=c(ymin,ymax),xlim=c(xmin,xmax),main=title,log="x",ylab=ylb,xlab=xlb, las=1) points(y.u~x2.u,pch=19) points((10^mean(x.u)),mean(y.u),pch=3,cex=3) # plot mean of x and y on linear scale mtext(paste("R2=", round(summary(mod2)$r.sq,3),sep=""), side = 3, line=0,adj=0, col="black",cex=0.8) pvalue<-pf(summary(mod2)[[10]][[1]],summary(mod2)[[10]][[2]], summary(mod2)[[10]][[3]],lower.tail=F) if(pvalue >=0.001) {mtext(paste("p=", pvalue, sep=""),side = 3, line =0, adj = 0.25, col = "black",cex=0.8)} if(pvalue <0.001) {mtext("p<0.001", side = 3, line= 0, adj = 0.25, col="black",cex=0.8)} mtext(paste("slope= ", round(1/summary(mod2)[[4]][[2]],3)," se ", round(summary(mod2)[[4]][[4]],3),sep=""), side = 3, line=0,adj=0.75, col="black",cex=0.8) new.x <- data.frame(x = seq(min(x, na.rm=T), max(x, na.rm=T), length = length(x)))#new values of X on log scale lines(10^(new.x$x),new.x$x*1/slope2+y0) lines(10^(new.x$x),new.x$x*1/slope2+y0+upresid.2,lty=2) lines(10^(new.x$x),new.x$x*1/slope2+y0+lwresid.2,lty=2) abline("h"=GM,lty=1) abline("v"=GM.mod2,lty=1) abline("v"=GMU.mod2,lty=3) abline("v"=GML.mod2,lty=3) text(xmin,GM+0.03,GM,pos=4) text(GM.mod2,ymin,GM.mod2,pos=3) text(GMU.mod2,ymin,GMU.mod2,pos=4) text(GML.mod2,ymin,GML.mod2,pos=2) #Set up title for Fig C and produce plot. title<-"Ranged major axis regression EQR v N" plot(y~x2,ylim=c(ymin,ymax),xlim=c(xmin,xmax),main=title,log="x",ylab=ylb, xlab=xlb,las=1) points(y.u~x2.u,pch=19) points(10^(mean(x.u)),mean(y.u),pch=3,cex=3) mtext(paste("RMA slope=", round(RMA.slope,3),sep=""), side = 3, line=0,adj=0.6, col="black",cex=0.8) new.x <- data.frame(x = seq(min(x, na.rm=T), max(x, na.rm=T), length = 500)) lines(10^(new.x$x),new.x$x*RMA.slope+RMA.int) lines(10^(new.x$x),new.x$x*RMA.slope+RMA.int+upresid.RMA,lty=2) lines(10^(new.x$x),new.x$x*RMA.slope+RMA.int+lwresid.RMA,lty=2) abline("h"=GM,lty=1) abline("v"=GM.mod4,lty=1) abline("v"=GMU.mod4,lty=3) abline("v"=GML.mod4,lty=3) text(xmin,GM+0.03,GM,pos=4) text(GM.mod4,ymin,GM.mod4,pos=3) text(GMU.mod4,ymin,GMU.mod4,pos=4) text(GML.mod4,ymin,GML.mod4,pos=2)
Collect information about the models together with the predicted boundary values into a data.frame called out1.
(out1<-data.frame( R2=c(round(summary(mod1)[[9]],3),"",""), N=c(length(x.u),"",""), slope=c(slope1,RMA.slope,1/slope2), int=c(int1,RMA.int,y0), GM =c(GM.mod1,GM.mod4,GM.mod2), GML=c(GML.mod1,GML.mod4,GML.mod2), GMU=c(GMU.mod1,GMU.mod4,GMU.mod2), HG =c(HG.mod1,HG.mod4,HG.mod2), HGL=c(HGL.mod1,HGL.mod4,HGL.mod2), HGU=c(HGU.mod1,HGU.mod4,HGU.mod2)))
Data.frame "out1" contains the following values as columns; R2 the number of observations used N, slope, intercept, predicted good/moderate boundary GM, lower and upper estimates of good/moderate boundaries GML GMU, the high/good boundary HG, lower and upper estimates of high/good boundaries HGL HGU. Rows are results for Model 1, Model 3 and Model 2.
Write this to clipboard for pasting into Excel
After formatting and adding titles in Excel a table can be produced (Table A 4)
write.table(out1,"clipboard",sep="\t",row.names=F,col.names=F)
knitr::kable(out1, caption = "Table A 4 Summary output from models showing predicted boundary values and ranges derived from the upper and lower quartiles of model residuals")
The plots that the script produces are shown as Figure A12
For the regression approach it is important that the data cover a sufficiently wide gradient. For example, in parts of Europe there may be few water bodies in good or high status, conversely for some water body types there may be few impacted examples.
An example from a data set of low alkalinity lakes is shown in Figure A13a, only one lake is worse than good status (EQR <0.60). A GAM model clearly indicates that EQR values level off where TP is <7 µgl-1 and a linear model fitted to data above this value needs to be extrapolated to determine the TP value at the good/moderate boundary. Additionally, the single lake in poor status is likely to be very influential. Fitting a type II model to these data would produce a steeper slope and as the mean of the data lies within high status, some distance from the good/moderate boundary the change in slope would significantly lower the good/moderate TP boundary. This illustrates that it is undesirable to use models where extrapolation is required when setting boundary values.
The best solution would be to obtain data from more impacted lakes of the same lake type, perhaps from a neighboring country, although this introduces additional problems of ensuring that the EQR metric is on a comparable scale. An alternative solution is to consider fitting a more complex model which includes other lake types. A visual inspection of the scatter plot (Figure A13b), suggests that other lake types have a similar slope to the low alkalinity lakes but have different intercepts. Thus, a model of the form EQR = a Log10[TP] + c + Type, might be appropriate, from which it is possible to calculate boundary values using a common slope and type specific intercepts.
[REQUIRE CODE TO GENERATE THIS FIGURE]
Figure A 13 Relationship between phytoplankton EQR and mean annual TP for a) low alkalinity clear water lakes and b) low (LA), moderate (MA), high (HA) alkalinity clear water and organic lakes. Red lines show gam models, green line linear OLS fit for low alkalinity lakes in range of TP 7-100 µgl-1, black line linear fit to all lake types. Cross marks the mean of EQR and TP for the low alkalinity lakes.
As this may be a relatively common problem an R script to do this is included in the toolkit (TKit_fit_lin_mod2.R). It is stressed that the approach used is not statistically perfect and other better solutions may be available so it is suggested that a statistician should be consulted. However, the toolkit provides a potential solution.
Use data in DataTemplate_Example2.csv and script TKit_fit_mod2. In this script, we fit a model that includes a categorical grouping variable, such as lake type that allows a model to be fitted that has the same slope but different intercepts to be calculated for each group (lake type).
The script follows a similar pattern to the previous example and is thus is not described in such detail. It is assumed that prior to running the script outliers have been identified and the linear portion of the data determined. As previously, three regression models are fitted, an OLS of EQR v TP, an OLS of TP v EQR and a geometric mean regression derived from the geometric average of the slopes from the two OLS models (as is used in the Excel toolkit).
The first section of script loads the data, selects records where both EQR and TP data are available and allocates the variables x,y to the data (Log10 TP & EQR). In addition, the variable z is used to hold the categorical grouping variable (Group), the as.factor function is used to ensure that numeric codes are treated as categories. The linear part of the data used for modelling are allocated to variables x.u,y.u and z.u, using the variables nut.min and nut.max to define the appropriate range.
Prepare the R Session.
rm(list= ls()) # Clear data #Enter file name and path FName <- system.file("extdata/DataTemplate_Example2.csv", package = "ecostattoolkit") #Get data and check data <- read.csv(file = FName, header = TRUE) dim(data) summary(data) #Identify the records with both EQR and TP data to produce a data set without missing values cc<-complete.cases(data$EQR,data$P) data.cc<-data[cc,] dim(data.cc)
Get all data for plotting
x<-log10(data.cc$P) x2<-10^x y<-data.cc$EQR z<-as.factor(data.cc$Group) zN<-length(levels(z)) # Enter the high/good and good/moderate boundary values HG<-0.80 GM<-0.60 # Assign data used for models to x.u and y.u nut.min<-7 #min nutrient value used for model nut.max<-100 #max nutrient value used for model x.u<-log10(data.cc$P[data.cc$P<=nut.max & data.cc$P>=nut.min & data.cc$Exclude_P==FALSE]) y.u<-data.cc$EQR[data.cc$P<=nut.max & data.cc$P>=nut.min & data.cc$Exclude_P==FALSE] x2.u<-10^x.u #back transformed x for plotting on linear scale z.u<-as.factor(data.cc$Group[data.cc$P<=nut.max & data.cc$P>=nut.min & data.cc$Exclude_P==FALSE]) zN.u<-length(levels(z.u)) # Check the data records are the same length length(x.u) length(y.u) length(z.u)
For convenience, the next section produces a plot which can be used to check and visualise the data (Figure A14).
MetricUsed<-"Phytoplankton" # labels for axis ylb<-"EQR" # Y axis label xlb<-expression(paste("total phosphorus ","(µg ", L^-1,")")) plot(y~x2,main=MetricUsed,log="x",ylab=ylb,xlab=xlb, las=1) points(y.u~x2.u,pch=19,col=rainbow(zN.u)[z.u]) legend("bottomleft",levels(z.u),col=rainbow(zN.u),pch=19,cex=0.8)
The next section is important as it checks whether the proposed modelling approach is appropriate. Three models are compared:
The models are compared using the Akaike Information Criterion (AIC), a measure of the relative quality of statistical models. The model with the lowest AIC being the most appropriate model to use.
Compare models including grouping variable, with and without interaction. With the example data the following output is produced.
summary(mod1<-lm(y.u~x.u)) #base model, no group variable summary(mod1a<-lm(y.u~x.u+z.u))#model with group variable, different intercepts summary(mod1b<-lm(y.u~x.u*z.u))#model with interaction term, different slopes & intercept AIC(mod1,mod1a,mod1b) #compare models, select model with lowest AIC
It shows that the models including type are substantially better than the simple model (mod1). There is very little difference between the models with and without interaction terms, so it is appropriate to continue using the script which fits models with a common slope (mod1a).
If the model with interaction (mod1b) had a significantly lower AIC, then this indicates that different slopes are needed and analysis of a combined data set in this way would not be appropriate.
Stop at this point if mod1b has the lowest AIC
The next 3 sections fit the two OLS models and then calculate the slope and intercepts for the 3rd model by taking the geometric average of model 1 and the inverted model 2. For each model different intercepts are calculated for each water body type. The script has been designed to work with different numbers of categories, but it is probably only appropriate to use up to 5 different categories. Examination of the model summaries will provide indications as to which types are most different and suggest ways of combining categories.
As in the previous simple example the upper and lower quartiles of the residual are calculated and used to estimate uncertainty of the predicted boundary values.
Model 1a OLS Y on X + Z
summary(mod1a<-lm(y.u~x.u+z.u)) # predict values of y from model 1 pred.y1a<-data.frame(predict(mod1a),resid(mod1a),z.u)#Calculate predicted values and residuals # Calculate upper and lower quartiles of residuals for 1st factor upresid.1a<-quantile(subset(pred.y1a$resid.mod1a,z.u==levels(z.u)[1]),0.75,na.rm=T) lwresid.1a<-quantile(subset(pred.y1a$resid.mod1a,z.u==levels(z.u)[1]),0.25,na.rm=T) # Extract the slope for model 1a slope1a<-summary(mod1a)[[4]][[2]] # Extract the intercept for 1st grouping factor of model 1a int1a<-summary(mod1a)[[4]][[1]]
The next lines of script calculate the boundary values and their upper and lower ranges. Initially the values for the 1st grouping factor are determined.
# Calculate GM and HG boundaries with upper and lower estimates for 1st factor GM.mod1a<-round(10^((GM-int1a[1])/slope1a),0) GMU.mod1a<-round(10^((GM-int1a[1]-upresid.1a[1])/slope1a),0) GML.mod1a<-round(10^((GM-int1a[1]-lwresid.1a[1])/slope1a),0) HG.mod1a<-round(10^((HG-int1a[1])/slope1a),0) HGU.mod1a<-round(10^((HG-int1a[1]-upresid.1a[1])/slope1a),0) HGL.mod1a<-round(10^((HG-int1a[1]-lwresid.1a[1])/slope1a),0)
The intercepts and boundaries for the remaining grouping factors are then determined
# Extract quartiles of residuals, intercepts, calculate boundaries for # remaining factors on model 1a for (i in 2:zN.u){ upresid.1a<-cbind(upresid.1a,quantile(subset(pred.y1a$resid.mod1a,z.u==levels(z.u)[i]),0.75,na.rm=T)) lwresid.1a<-cbind(lwresid.1a,quantile(subset(pred.y1a$resid.mod1a,z.u==levels(z.u)[i]),0.25,na.rm=T)) int1a<-cbind(int1a,(int1a[1]+summary(mod1a)[[4]][[(i+1)]])) GM.mod1a<-cbind(GM.mod1a,round(10^((GM-int1a[i])/slope1a),0)) GMU.mod1a<-cbind(GMU.mod1a,round(10^((GM-int1a[i]-upresid.1a[i])/slope1a),0)) GML.mod1a<-cbind(GML.mod1a,round(10^((GM-int1a[i]-lwresid.1a[i])/slope1a),0)) HG.mod1a<-cbind(HG.mod1a,round(10^((HG-int1a[i])/slope1a),0)) HGU.mod1a<-cbind(HGU.mod1a,round(10^((HG-int1a[i]-upresid.1a[i])/slope1a),0)) HGL.mod1a<-cbind(HGL.mod1a,round(10^((HG-int1a[i]-lwresid.1a[i])/slope1a),0)) }
The results are then placed in a data.frame (Sum.mod1a) to provide a convenient way to view the results of the 1st model and the predicted boundary values.
(Sum.mod1a<-data.frame( R2=c(round(summary(mod1a)[[9]],3)), N=c(length(x.u)), slope=c(1/slope1a), int=c(y0), GM =c(GM.mod1a), GML=c(GML.mod1a), GMU=c(GMU.mod1a), HG =c(HG.mod1a), HGL=c(HGL.mod1a), HGU=c(HGU.mod1a), row.names=levels(z.u)))
This produces the following output, GM the good/moderate boundary, GML & GMU the interquartile range for the good/moderate boundary, HG the high/good boundary and its interquartile range HGL & HGU. The R2, number of cases, slope.
Note that at the end of the script there is a different summary output that allows the different models to be compared for each water body category (type).
Similar approaches are then followed to determine the other models, although the script for model 2 is slightly more complex to allow for the model inversion needed to visualise the output. Details are not shown here, see script for details and to run the example.
The script then produces a plot comparing each of the models and showing the predicted boundary values (Figure A15). For clarity, the upper and lower ranges of the boundaries are not shown. This section of script starts by setting the axis limits and labels, which for a different example may need to be edited.
#.......................................................................................... # Model 2a OLS X on Y + Z #.......................................................................................... summary(mod2a<-lm(x.u~y.u+z.u)) # Extract the slope & intercept for 1st factor for model 2a slope2a<-summary(mod2a)[[4]][[2]] int2a<-summary(mod2a)[[4]][[1]] y0<-(0-int2a)/slope2a for (i in 2:zN.u){ int2a<-cbind(int2a,(int2a[1]+summary(mod2a)[[4]][[(i+1)]])) y0<-cbind(y0,(0-int2a[i])/slope2a) } # Set up a data frame with observed x, factor z and add appropriate intercept value y0a for category # calculate predicted y values (EQR) and calculate residuals (pred.y2a<-data.frame(x.u,y.u,z.u,y0[1])) colnames(pred.y2a)[4]<-"y0.i" for (i in 2:zN.u){ pred.y2a<-within(pred.y2a,y0.i[z.u==levels(z.u)[i]]<-y0[i]) } (pred.y2a<-within(pred.y2a,pred.y2a<-x.u*1/slope2a+y0.i)) #calc predicted values pred.y2 (pred.y2a<-within(pred.y2a,resid.2a<-y.u-pred.y2a)) #calc residuals resid.2 # calculate upper and lower 25th 75th quantiles of residuals upresid.2a<-quantile(subset(pred.y2a$resid.2a,z.u==levels(z.u)[1]),0.75,na.rm=T) lwresid.2a<-quantile(subset(pred.y2a$resid.2a,z.u==levels(z.u)[1]),0.25,na.rm=T) for (i in 2:zN.u){ upresid.2a<-cbind(upresid.2a,quantile(subset(pred.y2a$resid.2a,z.u==levels(z.u)[i]),0.75,na.rm=T)) lwresid.2a<-cbind(lwresid.2a,quantile(subset(pred.y2a$resid.2a,z.u==levels(z.u)[i]),0.25,na.rm=T)) } (GM.mod2a<-round(10^((GM-y0)*slope2a),0)) #Predicted GM boundary value for 1st factor (HA) (GMU.mod2a<-round(10^((GM-y0-upresid.2a)*slope2a),0)) (GML.mod2a<-round(10^((GM-y0-lwresid.2a)*slope2a),0)) (HG.mod2a<-round(10^((HG-y0)*slope2a),0)) #Predicted GM boundary value for 1st factor (HA) (HGU.mod2a<-round(10^((HG-y0-upresid.2a)*slope2a),0)) (HGL.mod2a<-round(10^((HG-y0-lwresid.2a)*slope2a),0)) (Sum.mod2a<-data.frame( R2=c(round(summary(mod2a)[[9]],3)), N=c(length(x.u)), slope=c(1/slope2a), int=c(y0), GM =c(GM.mod2a), GML=c(GML.mod2a), GMU=c(GMU.mod2a), HG =c(HG.mod2a), HGL=c(HGL.mod2a), HGU=c(HGU.mod2a), row.names=levels(z.u))) #.......................................................................................... # Model 3a Standard Major Axis (SMA) regression (geometric average of models 1 & 2) #.......................................................................................... # calculate the average slope and intercept of x on y and y on x (slope3a<-sign(slope1a)*sqrt(slope1a*1/slope2a))#geometric average or SMA regression int3a<-mean(y.u[z.u==levels(z.u)[1]])-(slope3a*mean(x.u[z.u==levels(z.u)[1]])) for (i in 2:zN.u){ int3a<-cbind(int3a,mean(y.u[z.u==levels(z.u)[i]])-(slope3a*mean(x.u[z.u==levels(z.u)[i]]))) } # Set up a data frame with observed x, factor z and add appropriate intercept value y0a for category # calculate predicted y values (EQR) and calculate residuals (pred.y3a<-data.frame(x.u,y.u,z.u,int3a[1])) colnames(pred.y3a)[4]<-"int3a.i" for (i in 2:zN.u){ pred.y3a<-within(pred.y3a,int3a.i[z.u==levels(z.u)[i]]<-int3a[i]) } (pred.y3a<-within(pred.y3a,pred.y3a<-x.u*slope3a+int3a.i)) #calc predicted values pred.y2 (pred.y3a<-within(pred.y3a,resid.3a<-y.u-pred.y3a)) # calculate upper and lower 25th 75th quantiles of residuals upresid.3a<-quantile(subset(pred.y3a$resid.3a,z.u==levels(z.u)[1]),0.75,na.rm=T) lwresid.3a<-quantile(subset(pred.y3a$resid.3a,z.u==levels(z.u)[1]),0.25,na.rm=T) for (i in 2:zN.u){ upresid.3a<-cbind(upresid.3a,quantile(subset(pred.y3a$resid.3a,z.u==levels(z.u)[i]),0.75,na.rm=T)) lwresid.3a<-cbind(lwresid.3a,quantile(subset(pred.y3a$resid.3a,z.u==levels(z.u)[i]),0.25,na.rm=T)) } (GM.mod3a<-round(10^((GM-int3a)/slope3a),0)) #Predicted GM boundary value for 1st factor (HA) (GMU.mod3a<-round(10^((GM-int3a-upresid.3a)/slope3a),0)) (GML.mod3a<-round(10^((GM-int3a-lwresid.3a)/slope3a),0)) (HG.mod3a<-round(10^((HG-int3a)/slope3a),0)) #Predicted GM boundary value for 1st factor (HA) (HGU.mod3a<-round(10^((HG-int3a-upresid.3a)/slope3a),0)) (HGL.mod3a<-round(10^((HG-int3a-lwresid.3a)/slope3a),0)) (Sum.mod3a<-data.frame( R2="", N=c(length(x.u)), slope=c(slope3a), int=c(int3a), GM =c(GM.mod3a), GML=c(GML.mod3a), GMU=c(GMU.mod3a), HG =c(HG.mod3a), HGL=c(HGL.mod3a), HGU=c(HGU.mod3a), row.names=levels(z.u)))
# re-set scales for plotting xmin<- 5 xmax<- 200 ymin<- 0 ymax<- 1.5 MetricUsed<-"Phytoplankton" # set up scales & labels for axis ylb<-"EQR" # Y axis label xlb<-expression(paste("total phosphorus ","(µg ", L^-1,")")) #win.graph(width=14,height=9) # new graphic window par(mfcol=c(3,2)) # delete this line if separate plots are required par(mar=c(4,5,5,1)) # Fig A title<-"Regression of EQR on TP" plot(y~x2,ylim=c(ymin,ymax),xlim=c(xmin,xmax),main=title,log="x",ylab=ylb,xlab=xlb, las=1) points(y.u~x2.u,pch=19,col=rainbow(zN.u)[z.u]) new.x <- data.frame(x = seq(min(x, na.rm=T), max(x, na.rm=T), length = length(x)))#new values of X on log scale for (i in 1:zN.u){ lines(10^(new.x$x),new.x$x*slope1a+int1a[i],col=rainbow(zN.u)[i]) points(10^(mean(x.u[z.u==levels(z.u)[i]])), mean(y.u[z.u==levels(z.u)[i]]),pch=3,cex=3,col=rainbow(zN.u)[i]) } legend("bottomleft",levels(z.u),col=rainbow(zN.u),pch=19,cex=1) mtext(paste("R2=", round(summary(mod1a)$r.sq,3),sep=""), side = 3, line=0,adj=0, col="black",cex=0.8) pvalue<-pf(summary(mod1a)[[10]][[1]],summary(mod1a)[[10]][[2]],summary(mod1a)[[10]][[3]],lower.tail=F) if(pvalue >=0.001) {mtext(paste("p=", round(pvalue,3), sep=""),side = 3, line =0, adj = 0.25, col = "black",cex=0.8)} if(pvalue <0.001) {mtext("p<0.001", side = 3, line= 0, adj = 0.25, col="black",cex=0.8)} mtext(paste("slope= ", round(slope1a,3)," se ", round(summary(mod1a)[[4]][[7]],3),sep=""),side = 3,line=0,adj=0.75, col="black",cex=0.8) abline("h"=GM,lty=2) abline("v"=GM.mod1a,lty=2,col=rainbow(zN.u)) text(xmin,GM,GM,cex=1,pos=3) text(GM.mod1a,ymin,GM.mod1a,pos=c(3,4),cex=1) # Fig B title<-"Regression of TP on EQR" plot(y~x2,ylim=c(ymin,ymax),xlim=c(xmin,xmax),main=title,log="x",ylab=ylb,xlab=xlb, las=1) points(y.u~x2.u,pch=19,col=rainbow(zN.u)[z.u]) new.x <- data.frame(x = seq(min(x, na.rm=T), max(x, na.rm=T), length = length(x)))#new values of X on log scale for (i in 1:zN.u){ lines(10^(new.x$x),new.x$x*1/slope2a+y0[i],col=rainbow(zN.u)[i]) points(10^(mean(x.u[z.u==levels(z.u)[i]])), mean(y.u[z.u==levels(z.u)[i]]),pch=3,cex=3,col=rainbow(zN.u)[i]) } mtext(paste("R2=", round(summary(mod2a)$r.sq,3),sep=""), side = 3, line=0,adj=0, col="black",cex=0.8) pvalue<-pf(summary(mod2a)[[10]][[1]],summary(mod2a)[[10]][[2]],summary(mod2a)[[10]][[3]],lower.tail=F) if(pvalue >=0.001) {mtext(paste("p=", pvalue, sep=""),side = 3, line =0, adj = 0.25, col = "black",cex=0.8)} if(pvalue <0.001) {mtext("p<0.001", side = 3, line= 0, adj = 0.25, col="black",cex=0.8)} mtext(paste("slope= ", round(1/slope2a,3)," se ", round(summary(mod2a)[[4]][[7]],3),sep=""), side = 3, line=0,adj=0.75, col="black",cex=0.8) abline("h"=GM,lty=2) abline("v"=GM.mod2a,lty=2,col=rainbow(zN.u)) text(xmin,GM,GM,cex=1,pos=3) text(GM.mod2a,ymin,GM.mod2a,pos=c(3,4),cex=1) # Fig C title<-"Geometric mean (SMA) regression\nEQR v TP" plot(y~x2,ylim=c(ymin,ymax),xlim=c(xmin,xmax),main=title,log="x",ylab=ylb, xlab=xlb,las=1) points(y.u~x2.u,pch=19,col=rainbow(zN.u)[z.u]) new.x <- data.frame(x = seq(min(x, na.rm=T), max(x, na.rm=T), length = length(x)))#new values of X on log scale for (i in 1:zN.u){ lines(10^(new.x$x),new.x$x*slope3a+int3a[i],col=rainbow(zN.u)[i]) points(10^(mean(x.u[z.u==levels(z.u)[i]])), mean(y.u[z.u==levels(z.u)[i]]),pch=3,cex=3,col=rainbow(zN.u)[i]) } mtext(paste("slope= ", round(slope3a,3),sep=""), side = 3, line=0,adj=0.5, col="black",cex=0.8) abline("h"=GM,lty=2) abline("v"=GM.mod3a,lty=2,col=rainbow(zN.u)) text(xmin,GM,GM,cex=1,pos=3) text(GM.mod3a,ymin,GM.mod3a,pos=c(3,4),cex=1) # Fig D title<-"" plot(y~x2,ylim=c(ymin,ymax),xlim=c(xmin,xmax),main=title,log="x",ylab=ylb,xlab=xlb, las=1) points(y.u~x2.u,pch=19,col=rainbow(zN.u)[z.u]) new.x <- data.frame(x = seq(min(x, na.rm=T), max(x, na.rm=T), length = length(x)))#new values of X on log scale for (i in 1:zN.u){ lines(10^(new.x$x),new.x$x*slope1a+int1a[i],col=rainbow(zN.u)[i]) points(10^(mean(x.u[z.u==levels(z.u)[i]])), mean(y.u[z.u==levels(z.u)[i]]),pch=3,cex=3,col=rainbow(zN.u)[i]) } legend("bottomleft",levels(z.u),col=rainbow(zN.u),pch=19,cex=1) abline("h"=HG,lty=2) abline("v"=HG.mod1a,lty=2,col=rainbow(zN.u)) text(xmin,HG,HG,cex=1,pos=3) text(HG.mod1a,ymin,HG.mod1a,pos=c(3,4),cex=1) # Fig E title<-"" plot(y~x2,ylim=c(ymin,ymax),xlim=c(xmin,xmax),main=title,log="x",ylab=ylb,xlab=xlb, las=1) points(y.u~x2.u,pch=19,col=rainbow(zN.u)[z.u]) new.x <- data.frame(x = seq(min(x, na.rm=T), max(x, na.rm=T), length = length(x)))#new values of X on log scale for (i in 1:zN.u){ lines(10^(new.x$x),new.x$x*1/slope2a+y0[i],col=rainbow(zN.u)[i]) points(10^(mean(x.u[z.u==levels(z.u)[i]])), mean(y.u[z.u==levels(z.u)[i]]),pch=3,cex=3,col=rainbow(zN.u)[i]) } abline("h"=HG,lty=2) abline("v"=HG.mod2a,lty=2,col=rainbow(zN.u)) text(xmin,HG,HG,cex=1,pos=3) text(HG.mod2a,ymin,HG.mod2a,pos=c(3,4),cex=1) # Fig F title<-"" plot(y~x2,ylim=c(ymin,ymax),xlim=c(xmin,xmax),main=title,log="x",ylab=ylb, xlab=xlb,las=1) points(y.u~x2.u,pch=19,col=rainbow(zN.u)[z.u]) new.x <- data.frame(x = seq(min(x, na.rm=T), max(x, na.rm=T), length = length(x)))#new values of X on log scale for (i in 1:zN.u){ lines(10^(new.x$x),new.x$x*slope3a+int3a[i],col=rainbow(zN.u)[i]) points(10^(mean(x.u[z.u==levels(z.u)[i]])), mean(y.u[z.u==levels(z.u)[i]]),pch=3,cex=3,col=rainbow(zN.u)[i]) } abline("h"=HG,lty=2) abline("v"=HG.mod3a,lty=2,col=rainbow(zN.u)) text(xmin,HG,HG,cex=1,pos=3) text(HG.mod3a,ymin,HG.mod3a,pos=c(3,4),cex=1)
Following the plot, the model summary information and boundary values are collected and placed in a series of data frames called Out1, Out 2 etc, each containing the summary for a single water body type. The final line of extract below places the output into the clipboard enabling a paste to Excel etc
for(i in 1:zN.u){ out<-data.frame( Type=c(levels(z.u)[i],"",""), R2=c(round(summary(mod1a)[[9]],3),"",round(summary(mod2a)[[9]],3)), N=c(length(x.u),"",""), slope=c(round(slope1a,3),round(slope3a,3),round(1/slope2a,3)), int=c(round(int1a[i],3),round(int3a[i],3),round(y0[i],3)), GM =c(GM.mod1a[i],GM.mod3a[i],GM.mod2a[i]), GML=c(GML.mod1a[i],GML.mod3a[i],GML.mod2a[i]), GMU=c(GMU.mod1a[i],GMU.mod3a[i],GMU.mod2a[i]), HG =c(HG.mod1a[i],HG.mod3a[i],HG.mod2a[i]), HGL=c(HGL.mod1a[i],HGL.mod3a[i],HGL.mod2a[i]), HGU=c(HGU.mod1a[i],HGU.mod3a[i],HGU.mod2a[i]), row.names=c("OLS EQR v TP","geometric mean regression","OLS TP v EQR")) assign(paste("Out",i,sep=""),out) }
The following lines show output on console and then place in clipboard.
print(Out1)
write.table(Out1,"clipboard",sep="\t",row.names=F,col.names=F) # Now paste to Excel
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.